Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation #45234

Closed

Conversation

erenavsarogullari
Copy link
Member

@erenavsarogullari erenavsarogullari commented Feb 23, 2024

What changes were proposed in this pull request?

AQE can materialize both ShuffleQueryStage and BroadcastQueryStage on the cancellation. This causes unnecessary stage materialization by submitting Shuffle Job and Broadcast Job. Under normal circumstances, if the stage is already non-materialized (a.k.a ShuffleQueryStage.shuffleFuture or BroadcastQueryStage.broadcastFuture is not initialized yet), it should just be skipped without materializing it.

Problematic Stacktrace:

at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.submitShuffleJob(ShuffleExchangeExec.scala:104)
at org.apache.spark.sql.execution.adaptive.ShuffleQueryStageExec.shuffleFuture$lzycompute(QueryStageExec.scala:210)
at org.apache.spark.sql.execution.adaptive.ShuffleQueryStageExec.shuffleFuture(QueryStageExec.scala:210)
at org.apache.spark.sql.execution.adaptive.ShuffleQueryStageExec.cancel(QueryStageExec.scala:223)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$cleanUpAndThrowException$1(AdaptiveSparkPlanExec.scala:905)

Please find sample use-case:
1- Stage Materialization Steps:
When stage materialization is failed:

1.1- ShuffleQueryStage1 - is materialized successfully,
1.2- ShuffleQueryStage2 - materialization is failed,
1.3- ShuffleQueryStage3 - Not materialized yet so ShuffleQueryStage3.shuffleFuture is not initialized yet

2- Stage Cancellation Steps:

2.1- ShuffleQueryStage1 - is canceled due to already materialized,
2.2- ShuffleQueryStage2 - is earlyFailedStage so currently, it is skipped as default by AQE because it could not be materialized,
2.3- ShuffleQueryStage3 - Problem is here: This stage is not materialized yet but currently, it is also tried to cancel and this stage requires to be materialized first.

Reproduce Steps:
https://github.com/apache/spark/pull/45234/files#diff-f89f2fe78b324c6bc7190bef84220181f3616efc156ea99b3f15d375a22d7f88R900

Why are the changes needed?

Current logic introduces unnecessary Shuffle Job / Broadcast Job to be able to cancel ShuffleQueryStage / BroadcastQueryStage.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added new Unit Tests

Was this patch authored or co-authored using generative AI tooling?

No

@erenavsarogullari erenavsarogullari changed the title [SPARK-47148][SQL] Avoid to materialize AQE ShuffleQueryStage on the cancellation [SPARK-47148][SQL] Avoid to materialize AQE QueryStages on the cancellation Feb 29, 2024
@erenavsarogullari erenavsarogullari changed the title [SPARK-47148][SQL] Avoid to materialize AQE QueryStages on the cancellation [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation Feb 29, 2024
@erenavsarogullari
Copy link
Member Author

Should we also backport this patch to v3.4.x and v3.5.x?

@erenavsarogullari erenavsarogullari force-pushed the SPARK-47148 branch 2 times, most recently from 53dd089 to a7a869f Compare March 1, 2024 23:35
@ulysses-you
Copy link
Contributor

cc @cloud-fan @maryannxue as well

@erenavsarogullari erenavsarogullari force-pushed the SPARK-47148 branch 2 times, most recently from 355aeb0 to a923c2a Compare March 7, 2024 03:23
@erenavsarogullari erenavsarogullari force-pushed the SPARK-47148 branch 2 times, most recently from 867134b to edf95b3 Compare March 18, 2024 23:40
@erenavsarogullari erenavsarogullari force-pushed the SPARK-47148 branch 3 times, most recently from c178f2e to d0e4127 Compare March 30, 2024 22:54
@erenavsarogullari erenavsarogullari force-pushed the SPARK-47148 branch 2 times, most recently from aaafa1d to 4378034 Compare April 20, 2024 01:02
@erenavsarogullari
Copy link
Member Author

Thanks @cloud-fan and @ulysses-you for the reviews and approval.
I have just rebased to get green build.

@erenavsarogullari
Copy link
Member Author

Build is green now so PR is ready to be merged. Thanks in advance.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in d913d1b Apr 29, 2024
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
… the cancellation

### What changes were proposed in this pull request?
AQE can materialize both `ShuffleQueryStage` and `BroadcastQueryStage` on the cancellation. This causes unnecessary stage materialization by submitting Shuffle Job and Broadcast Job. Under normal circumstances, if the stage is already non-materialized (a.k.a `ShuffleQueryStage.shuffleFuture` or `BroadcastQueryStage.broadcastFuture` is not initialized yet), it should just be skipped without materializing it.

**Problematic Stacktrace:**
```
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.submitShuffleJob(ShuffleExchangeExec.scala:104)
at org.apache.spark.sql.execution.adaptive.ShuffleQueryStageExec.shuffleFuture$lzycompute(QueryStageExec.scala:210)
at org.apache.spark.sql.execution.adaptive.ShuffleQueryStageExec.shuffleFuture(QueryStageExec.scala:210)
at org.apache.spark.sql.execution.adaptive.ShuffleQueryStageExec.cancel(QueryStageExec.scala:223)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$cleanUpAndThrowException$1(AdaptiveSparkPlanExec.scala:905)
```

Please find sample use-case:
**1- Stage Materialization Steps:**
When stage materialization is failed:
```
1.1- ShuffleQueryStage1 - is materialized successfully,
1.2- ShuffleQueryStage2 - materialization is failed,
1.3- ShuffleQueryStage3 - Not materialized yet so ShuffleQueryStage3.shuffleFuture is not initialized yet
```
**2- Stage Cancellation Steps:**
```
2.1- ShuffleQueryStage1 - is canceled due to already materialized,
2.2- ShuffleQueryStage2 - is earlyFailedStage so currently, it is skipped as default by AQE because it could not be materialized,
2.3- ShuffleQueryStage3 - Problem is here: This stage is not materialized yet but currently, it is also tried to cancel and this stage requires to be materialized first.
```
**Reproduce Steps:**
https://github.com/apache/spark/pull/45234/files#diff-f89f2fe78b324c6bc7190bef84220181f3616efc156ea99b3f15d375a22d7f88R900

### Why are the changes needed?
Current logic introduces unnecessary Shuffle Job / Broadcast Job to be able to cancel `ShuffleQueryStage` / `BroadcastQueryStage`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added new Unit Tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#45234 from erenavsarogullari/SPARK-47148.

Authored-by: erenavsarogullari <erenavsarogullari@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
LuciferYang pushed a commit that referenced this pull request Jun 18, 2024
### What changes were proposed in this pull request?

A followup of #45234 to make the test more stable by using broadcast hint.

### Why are the changes needed?

test improvement

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #47007 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
* such as waiting for the subqueries.
*/
@transient private lazy val shuffleFuture: Future[MapOutputStatistics] = executeQuery {
materializationStarted.set(true)
Copy link
Contributor

@cloud-fan cloud-fan Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a closer look, I don't think this change works as we expect. We set this materializationStarted flag before we return the Future, which means we are still on the AQE loop's main thread. That said, once we submit a query stage, its materializationStarted becomes true immediately and we can't really avoid the wasted query stage execution when cancelling it.

The test passed because ShuffleExchangeExec calls child.execute() before returning the Future. Then we exit the AQE loop without cancelling other stages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further more, I don't think this idea works. Let's say when we want to cancel a query stage and the materializationStarted flag is false, we decide to skip the cancellation but maybe the next second the materializationStarted becomes true and we miss to cancel the shuffle job.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a bit of synchronization here. The shuffle node should have two fields: isCancelled flag and the shuffle job Future.

  • When we cancel a shuffle, we lock on the shuffle node, and set isCancelled flag to true. Then if the shuffle job Future is present, we cancel it.
  • When we are going to submit a shuffle, we lock on the shuffle node. Then: if the isCancelled flag is true, fail immediately, otherwise, submit the shuffle job and set the Future field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems not an acutally issue for now. AQE always do materilize stage and cancel stage at main thread. So if we decide to cancel stage then that means we will never do materilize stage again. We may need improve this code if we support do materilize concurrently in future.

cloud-fan added a commit that referenced this pull request Aug 5, 2024
…ages

### What changes were proposed in this pull request?

We missed the fact that submitting a shuffle or broadcast query stage can be heavy, as it needs to submit subqueries and wait for the results. This blocks the AQE loop and hurts the parallelism of AQE.

This PR fixes the problem by using shuffle/broadcast's own thread pool to wait for subqueries and other preparations.

This PR also re-implements #45234 to avoid submitting the shuffle job if the query is failed and all query stages need to be cancelled.

### Why are the changes needed?

better parallelism for AQE

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test case

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #47533 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
fusheng-rd pushed a commit to fusheng-rd/spark that referenced this pull request Aug 6, 2024
…ages

### What changes were proposed in this pull request?

We missed the fact that submitting a shuffle or broadcast query stage can be heavy, as it needs to submit subqueries and wait for the results. This blocks the AQE loop and hurts the parallelism of AQE.

This PR fixes the problem by using shuffle/broadcast's own thread pool to wait for subqueries and other preparations.

This PR also re-implements apache#45234 to avoid submitting the shuffle job if the query is failed and all query stages need to be cancelled.

### Why are the changes needed?

better parallelism for AQE

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test case

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#47533 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants